*New* The MotherDuck Native Integration is Live on Vercel Marketplace for Embedded Analytics and Data AppsLearn more

GROUP BY clause

Back to DuckDB Data Engineering Glossary

Overview

The GROUP BY clause is a fundamental SQL operation that allows you to organize rows into groups based on one or more columns, enabling aggregate calculations on each group rather than the entire dataset. When you use GROUP BY, each group will contain all rows that share the same values in the specified grouping columns.

Basic Usage

The most common use of GROUP BY involves aggregating data by a single column. For example, to count orders by country:

Copy code

SELECT country, COUNT(*) as order_count FROM orders GROUP BY country;

Multiple Column Groups

You can group by multiple columns to create more specific groupings. The order of columns in the GROUP BY doesn't affect the results:

Copy code

SELECT country, product_category, SUM(revenue) as total_revenue FROM sales GROUP BY country, product_category;

DuckDB-Specific Features

DuckDB extends the standard GROUP BY functionality with helpful features like GROUP BY ALL, which automatically groups by all non-aggregated columns in the SELECT clause:

Copy code

-- These queries are equivalent in DuckDB SELECT country, region, SUM(sales) FROM orders GROUP BY country, region; SELECT country, region, SUM(sales) FROM orders GROUP BY ALL;

Common Pitfalls

When using GROUP BY, all columns in the SELECT clause must either be included in the GROUP BY clause or be wrapped in an aggregate function (like SUM, COUNT, AVG). DuckDB will return an error if this rule is violated, helping prevent accidental misuse of grouping operations.

The GROUP BY clause is often used with:

  • HAVING clause to filter grouped results
  • Window functions for more complex aggregations
  • ORDER BY to sort the grouped results
  • Aggregate functions like COUNT, SUM, AVG, MAX, and MIN

For more advanced grouping operations in DuckDB, look into GROUPING SETS, ROLLUP, and CUBE clauses which provide additional ways to aggregate data at multiple levels simultaneously.